We collected the data by online survey on ed, DATA1001 discord and r/USYD
DATA CLEANING:
We removed the row that was impossible (7000 meals a week). We changed the values that seemed like mistakes in units (1.62 m and 1.71 m were changed into centimeters; one row with 90, 180, 90 minutes of exercise was changed into hours). We marked the empty cells with -9.
BMI was calculated as \(weight/height^2\), with units of kg/m2 (Body Mass Index (BMI) Calculator, 2021)
library(tidyverse)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.0.5
library(plotly)
# Read in your survey data
survey = read.csv("Survey - How does the type of diet and Exercise Affect BMI _.csv")
#removing the variable Timestamp
survey = select(survey, -Timestamp)
#fixing the names
my_names <- c("Gender",
"Weight",
"Height",
"LowIntensity",
"ModerateIntensity",
"HighIntensity",
"ProteinMeals",
"CarbMeals",
"VegMeals",
"Dessert",
"ProcessedFood")
names(survey) <- my_names
#creating the variable BMI
survey <-within(survey,BMI <- (Weight)/(Height*0.01)^2)
#looking at the outliers
#ggplot(survey, aes(y = ProteinMeals + CarbMeals + VegMeals, x = )) +
# geom_boxplot()
#no lower outliers, two upper outliers
Structure of our Data
str(survey)
## 'data.frame': 43 obs. of 12 variables:
## $ Gender : chr "Male" "Male" "Female" "Male" ...
## $ Weight : num 60 82 45 70 55 52 63 45 135 127 ...
## $ Height : num 174 172 160 162 158 ...
## $ LowIntensity : num 6 6 2 2 4 6 2 8 0 1 ...
## $ ModerateIntensity: int 2 2 1 4 2 2 1 0 0 5 ...
## $ HighIntensity : num 6 2 0 0 0 1 0 0 0 5 ...
## $ ProteinMeals : int 12 19 11 10 18 4 10 5 5 12 ...
## $ CarbMeals : int 8 10 12 7 14 3 2 4 8 5 ...
## $ VegMeals : int 7 9 3 1 14 5 2 8 4 4 ...
## $ Dessert : int 5 3 0 3 1 1 0 15 12 2 ...
## $ ProcessedFood : int 10 1 4 3 1 0 1 15 12 5 ...
## $ BMI : num 19.8 27.7 17.6 26.7 22.2 ...
Numerical Summaries
summary(survey)
## Gender Weight Height LowIntensity
## Length:43 Min. : 44.00 Min. :150.0 Min. : 0.000
## Class :character 1st Qu.: 54.00 1st Qu.:163.0 1st Qu.: 2.000
## Mode :character Median : 60.00 Median :169.0 Median : 2.000
## Mean : 65.15 Mean :168.8 Mean : 3.651
## 3rd Qu.: 70.00 3rd Qu.:175.0 3rd Qu.: 5.000
## Max. :135.00 Max. :183.0 Max. :24.000
## ModerateIntensity HighIntensity ProteinMeals CarbMeals
## Min. :0.000 Min. :0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.: 6.000 1st Qu.: 6.000
## Median :1.000 Median :0.000 Median : 8.000 Median : 7.000
## Mean :1.837 Mean :1.198 Mean : 9.047 Mean : 8.465
## 3rd Qu.:2.500 3rd Qu.:1.250 3rd Qu.:11.000 3rd Qu.:11.000
## Max. :9.000 Max. :7.000 Max. :21.000 Max. :21.000
## VegMeals Dessert ProcessedFood BMI
## Min. : 0.000 Min. : 0.000 Min. :-9.000 Min. :16.33
## 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 1.500 1st Qu.:19.08
## Median : 5.000 Median : 4.000 Median : 3.000 Median :21.31
## Mean : 6.279 Mean : 5.326 Mean : 3.674 Mean :22.78
## 3rd Qu.: 8.500 3rd Qu.: 7.000 3rd Qu.: 5.000 3rd Qu.:23.70
## Max. :21.000 Max. :19.000 Max. :15.000 Max. :46.11
#cor(survey[ , -1])
#plot(survey)
We created the variables for ratio, which was Meal Count of Type/Total Meal Count for major meals.
#creating the meal type ratio variables.
#Type of Meal Count / Total Meal Count
ratios <- survey %>%
mutate(CRatio = CarbMeals/(ProteinMeals+CarbMeals + VegMeals)) %>%
mutate(PRatio = ProteinMeals/(ProteinMeals+CarbMeals + VegMeals)) %>%
mutate(VRatio = VegMeals/(ProteinMeals+CarbMeals + VegMeals)) %>%
select(Gender, BMI, CRatio, VRatio, PRatio)
#tidier for ggplot
longratios <- ratios %>% pivot_longer(c("CRatio", "VRatio", "PRatio"), names_to = "RatioType", values_to = "MealRatio")
summary(longratios)
## Gender BMI RatioType MealRatio
## Length:129 Min. :16.33 Length:129 Min. :0.0000
## Class :character 1st Qu.:19.05 Class :character 1st Qu.:0.2353
## Mode :character Median :21.31 Mode :character Median :0.3333
## Mean :22.78 Mean :0.3333
## 3rd Qu.:23.73 3rd Qu.:0.4286
## Max. :46.11 Max. :1.0000
str(longratios)
## tibble [129 x 4] (S3: tbl_df/tbl/data.frame)
## $ Gender : chr [1:129] "Male" "Male" "Male" "Male" ...
## $ BMI : num [1:129] 19.8 19.8 19.8 27.7 27.7 ...
## $ RatioType: chr [1:129] "CRatio" "VRatio" "PRatio" "CRatio" ...
## $ MealRatio: num [1:129] 0.296 0.259 0.444 0.263 0.237 ...
LIMITATIONS
Firstly, our data was biased as there were more females than males, and people possibly undercounted the amount of vegetables and carbohydrates they eat, as they are often eaten as side dishes.
#to check for bias in our sample.
ggplot(survey ,aes(x = Gender)) + geom_bar() + ylab("")+ ggtitle("Participants Divided by Gender")
Secondly, error was introduced by the different meal sizes between individuals. Total calories consumed would have been a valuable variable but it would have been hard to measure by individuals. Also, it might be hard to classify each type of food correctly, for example, sausages.
Females had a higher average BMI and also the maximum BMI. Neither graph was normal, and both were skewed left, this implies our data was too small.
ggplot(survey,aes(x=BMI,fill=Gender,))+geom_histogram(binwidth=1,color="black")+facet_grid(~Gender)+theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
Protein based meals had the highest median consumption, while vegetable based meals had the lowest. Spreads were similar. It’s interesting that 0 is not an outlier.
NewSurvey = survey %>% select(CarbMeals, ProteinMeals, VegMeals)
nsrvynames = c("Carbohydrate Based Meals",
"Protein Based Meals",
"Vegetable Based Meals")
names(NewSurvey) = nsrvynames
NewSurvey <- NewSurvey %>% pivot_longer(c("Carbohydrate Based Meals",
"Protein Based Meals",
"Vegetable Based Meals"), names_to = "MealType", values_to = "MealCount")
ya <- list(title = "", titlefont = F )
xa <- list(title = "Weekly Meal Count", titlefont = F )
#https://plotly.com/r/figure-labels/
#(*Axes Labels in R*)
p4 = plot_ly(NewSurvey, x = ~MealCount, y = ~MealType, color = ~MealType, type = 'box')
p4 <- p4%>% layout(showlegend = F,
title = "Weekly Meal Counts by Type",
xaxis = xa,
yaxis = ya)
p4
Correlation between Meal Ratio’s and BMI and their R2 values respectively
Protein Ratios
#Protein Ratios and BMI
PrBMI = lm(BMI ~ PRatio, data = ratios)
cor(ratios$BMI, ratios$PRatio)
## [1] 0.4243385
summary(PrBMI)$r.squared
## [1] 0.1800632
Carbohydrate Ratios
CrBMI = lm(BMI ~ CRatio, data = ratios)
cor(ratios$BMI, ratios$CRatio)
## [1] -0.3358918
summary(CrBMI)$r.squared
## [1] 0.1128233
Vegetable Ratios
VrBMI = lm(BMI ~ VRatio, data = ratios)
cor(ratios$BMI, ratios$VRatio)
## [1] -0.2339149
summary(VrBMI)$r.squared
## [1] 0.05471616
ggplot(longratios, aes(x = MealRatio, y = BMI, colour = RatioType, shape = RatioType)) +
geom_point() +
geom_smooth(method = "lm", se = F)+
xlab("Ratio of Each Type") +ggtitle("Ratios of a Meal Count of a Given Type to Total Meal Count, in Relation to BMI")+
scale_colour_discrete(name = "Meal Types",
breaks = c("CRatio", "PRatio", "VRatio"),
labels = c("Carbohydrate Based", "Protein Based", "Vegetable Based"))+
scale_shape_discrete(name = "Meal Types",
breaks = c("CRatio", "PRatio", "VRatio"),
labels = c("Carbohydrate Based", "Protein Based", "Vegetable Based"))+
theme(legend.position = "bottom")
## `geom_smooth()` using formula 'y ~ x'
There was a weak positive correlation between protein meal ratio and BMI (R2 = 0.1800632). There was an even weaker negative correlation between carbohydrate meal ratio and BMI (R2 = 0.1128233). There was no correlation between vegetable meal ratio and BMI (R2 = 0.0547162). The negative trend of carbohydrate is supported by the literature (Merchant et al., 2009), and it is possible the positive trend of protein is related to meat consumption.
ggplot(CrBMI) +
geom_point(aes(x=.fitted, y=.resid))+
#source: https://community.rstudio.com/t/ggplot-makes-residual-plots/738
#(atiretoo, 2017)
geom_abline(slope = 0, colour = "grey", alpha = 0.8)+
# geom_smooth(aes(x=.fitted, y=.resid), se = F, colour = "Black")+
ylab("Residuals") + xlab("Fitted Values") + ggtitle("Carbohydrate Ratio and BMI Residual Plot")
#they are random.
ggplot(PrBMI) +
geom_point(aes(x=.fitted, y=.resid))+
geom_abline(slope = 0, colour = "grey", alpha = 0.8)+
# geom_smooth(aes(x=.fitted, y=.resid), se = F, colour = "Black")+
ylab("Residuals") + xlab("Fitted Values") + ggtitle("Protein Ratios Residual Plot")
#they are random.
A linear model is acceptable for carbohydrate ratio, but is not ideal for protein ratio, as the residuals have a downward trend, and are not homoscedastic.
#housekeeping for this section
snacks = survey %>% filter(ProcessedFood > 0)
newsnacks = survey %>% filter(ProcessedFood > 0) %>% pivot_longer(c("Dessert", "ProcessedFood"), names_to = "snacktype", values_to = "amount")
#looking at if the snacks consumption is internally related.
ggplot(snacks, aes(x = ProcessedFood, y = Dessert)) +
geom_point() +
geom_smooth(method = "lm", colour = "Black", se = F)+
xlab("Number of Times Proccesed Food Was Eaten Weekly") + ylab("Number of Dessert Protions Consumed Weekly") + ggtitle("Snacks in Relation to Each Other")
## `geom_smooth()` using formula 'y ~ x'
#linear model
DP = lm(Dessert ~ ProcessedFood, data = snacks)
We see that a higher consumption of one kind of snack (among desserts or processed food) somewhat correlates (R2 = 0.3469671) with a higher consumption of the other, implying the possibility of a common cause, such as habits.
#residuals
ggplot(DP) +
geom_point(aes(x=.fitted, y=.resid))+
geom_abline(slope = 0, colour = "grey", alpha = 0.8)+
ylab("Residuals") + xlab("Fitted Values") + ggtitle("Residual Plot")
#they are not random.
The residual plot is random; thus the linear model is acceptable.
#Looking if there is a relationship between snack consumption and BMI
ggplot(newsnacks, aes(x = amount , y = BMI, colour = snacktype, shape = snacktype)) +
geom_point() +
xlab("Snacks Consumed (Servings/Week)" ) + ggtitle("BMI in Relation to Snack Consumption")+
scale_colour_discrete(name = "Snack Types",
breaks = c("Dessert", "ProcessedFood"),
labels = c("Servings of dessert", "Servings of processed food"))+
scale_shape_discrete(name = "Snack Types",
breaks = c("Dessert", "ProcessedFood"),
labels = c("Servings of dessert", "Servings of processed food"))+
theme(legend.position = "bottom")
Correlation between weekly dessert consumption and BMI and its R2
#linear model of Bmi and desserts
DBMI = lm(BMI ~ Dessert, data = snacks)
#correlation
cor(snacks$BMI, snacks$Dessert)
## [1] -0.05235037
#r^2
summary(DBMI)$r.squared
## [1] 0.002740561
Weekly processed food consumption and BMI’s correlation and its R2
#linear model of Bmi and processed food
PFBMI = lm(BMI ~ ProcessedFood, data = snacks)
#correlation
cor(snacks$BMI, snacks$ProcessedFood)
## [1] 0.07654133
#r^2
summary(PFBMI)$r.squared
## [1] 0.005858575
There is no significant linear correlation between either processed food (R2 = 0.0058586) or desserts (R2 = 0.0027406) and BMI. Others have found a negative relation (Ritchie, 2012).
Although more calories consumed leads to higher BMI (Togo et al., 2001), more meal count does not necessarily imply that.
ggplot(survey,aes(main = "Body Mass Index by Type of Exercise", y = BMI))+
geom_point(aes(x = LowIntensity, colour = "Low Intensity "),size = 3, shape = 20, show.legend = T) +
geom_point(aes(x= ModerateIntensity, colour = "Moderate Intensity"),size = 3, shape = 17, show.legend = T)+
geom_point(aes(x= HighIntensity, colour = "High Intensity"),size = 2.5, shape = 15, show.legend = T) +
labs(title = "Body Mass Index by Type of Exercise", x = "Hours of Exercise by Type (per week)", y = "Body Mass Index", colour = "Exercise Type")
R2 Values
High Intensity
HBMI = lm(HighIntensity ~ BMI, data = survey)
summary(HBMI)$r.squared
## [1] 0.005816775
Moderate Intensity
MBMI = lm(ModerateIntensity ~ BMI, data = survey)
summary(MBMI)$r.squared
## [1] 6.855439e-06
Low Intensity
LBMI = lm(LowIntensity ~ BMI, data = survey)
summary(LBMI)$r.squared
## [1] 0.04629996
Exercise duration regardless of intensity was not correlated with BMI. Low intensity was most correlated (R2 = 0.0463), then came high intensity (R2 = 0.0058168), and finally the moderate intensity was impressively uncorrelated (R2 = 6.8554394^{-6}).
atiretoo. (2017, 16/09/2017). Ggplot() makes residual plots?! Retrieved 25/04/2021 from https://community.rstudio.com/t/ggplot-makes-residual-plots/738
Axes Labels in R. Plotly. Retrieved 25 April from https://plotly.com/r/figure-labels/
Body Mass Index (BMI) Calculator. (2021, 23 April). Canadian Diabetes Association. Retrieved 24 April from https://www.diabetes.ca/managing-my-diabetes/tools—resources/body-mass-index-(bmi)-calculator#:~:text=Body%20Mass%20Index%20is%20a,range%20is%2018.5%20to%2024.9.
Merchant, A. T., Vatanparast, H., Barlas, S., Dehghan, M., Shah, S. M. A., De Koning, L., & Steck, S. E. (2009). Carbohydrate intake and overweight and obesity among healthy adults. Journal of the American Dietetic Association, 109(7), 1165-1172.
Ritchie, L. D. (2012). Less frequent eating predicts greater BMI and waist circumference in female adolescents. The American journal of clinical nutrition, 95(2), 290-296.
Togo, P., Osler, M., Sørensen, T., & Heitmann, B. (2001). Food intake patterns and body mass index in observational studies. International journal of obesity, 25(12), 1741-1751.
(A):510098984 (R):500599576 (S):510221870
Every member contributed evenly in the sections 1 and 2.3. .
2.1’s dplyr section was done by A and the rest by R, but decisions and writing was done collectively.
A worked on sections 2.2.3 and 2.4, and helped with dplyr and general formatting.
R worked 2.2.4 and more on 2.1
S worked on 2.2.1 and 2.2.2
All 3 members came to all the meetings.
03/04/2021: The decision on the topic and the creation of the survey. Took about 2 hours. Next meeting was decided to be the week 6 lab class, to be able to have data to work with. Each person gave some ideas and then we evaluated them based on feasibility when it came to our data collection (survey).
13/04/2021: While cleaning the data, there were some illogical answers, and we needed to calculate BMI.
Had a discussion of possible research questions:
What’s the difference between the genders when it comes to BMI?
BMI-diet?
BMI-exercise?
Protein vs HighExercise (linear)
17/04/2021 We wrote the IDA and did the data cleanup, and had a discussion on the weaknesses of the study. We then divided the questions, not all questions that we had brainstormed were used.
20/04/2021 Discussion about how we were answering the questions.
24/04/2021 We got all the plots into one place, wrote 2.3., edited overall to shorten the length and for presentation.
27/04/2021 Final touches, lots of grammar edits, and video recording on zoom.